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Sparse  Modeling  Framework 


Pablo  Sprcchmann,1!  Ignacio  Ramirez,1!  Guillermo  Sapiro1  and  Yonina  C.  Eldar2 
University  of  Minnesota  and  2Technion 


Abstract 

Sparse  modeling  is  a  powerful  framework  for  data  analysis  and  processing.  Traditionally,  encoding  in 
this  framework  is  performed  by  solving  an  t\ -regularized  linear  regression  problem,  commonly  referred 
to  as  Lasso  or  basis  pursuit.  In  this  work  we  combine  the  sparsity-inducing  property  of  the  Lasso  model 
at  the  individual  feature  level,  with  the  block-sparsity  property  of  the  Group  Lasso  model,  where  sparse 
groups  of  features  are  jointly  encoded,  obtaining  a  sparsity  pattern  hierarchically  structured.  This  results  in 
the  Hierarchical  Lasso  (HiLasso),  which  shows  important  practical  modeling  advantages.  We  then  extend 
this  approach  to  the  collaborative  case,  where  a  set  of  simultaneously  coded  signals  share  the  same  sparsity 
pattern  at  the  higher  (group)  level,  but  not  necessarily  at  the  lower  (inside  the  group)  level,  obtaining  the 
collaborative  HiLasso  model  (C-HiLasso).  Such  signals  then  share  the  same  active  groups,  or  classes, 
but  not  necessarily  the  same  active  set.  This  model  is  very  well  suited  for  applications  such  as  source 
identification  and  separation.  An  efficient  optimization  procedure,  which  guarantees  convergence  to  the 
global  optimum,  is  developed  for  these  new  models.  The  underlying  presentation  of  the  new  framework 
and  optimization  approach  is  complemented  with  experimental  examples  and  theoretical  results  regarding 
recovery  guarantees  for  the  proposed  models. 


I.  Introduction  and  Motivation 

Sparse  signal  modeling  has  been  shown  to  lead  to  numerous  state-of-the-art  results  in  signal  processing, 
in  addition  to  being  very  attractive  at  the  theoretical  level.  The  standard  model  assumes  that  a  signal  can 
be  efficiently  represented  by  a  sparse  linear  combination  of  atoms  from  a  given  or  learned  dictionary. 

ffi.  S.  and  I.  R.  contributed  equally  to  this  work. 
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The  selected  atoms  form  what  is  usually  referred  to  as  the  active  set ,  whose  cardinality  is  significantly 
smaller  than  the  size  of  the  dictionary  and  the  dimension  of  the  signal. 

In  recent  years,  it  has  been  shown  that  adding  structural  constraints  to  this  active  set  has  value  both  at 
the  level  of  representation  robustness  and  at  the  level  of  signal  interpretation  (in  particular  when  the  active 
set  indicates  some  physical  properties  of  the  signal);  see  [1],  [2],  [3]  and  references  therein.  This  leads 
to  group  or  structured  sparse  coding,  where  instead  of  considering  the  atoms  as  singletons,  the  atoms  arc 
grouped,  and  a  few  groups  are  active  at  a  time.  An  alternative  way  to  add  structure  (and  robustness)  to 
the  problem  is  to  consider  the  simultaneous  encoding  of  multiple  signals,  requesting  that  they  all  share 
the  same  active  set.  This  is  a  natural  collaborative  filtering  approach  to  sparse  coding;  see,  for  example, 
[4],  [5],  [6],  [7],  [8],  [9], 

In  this  work  we  extend  these  models  in  a  number  of  directions.  First,  we  present  a  hierarchical  sparse 
model,  where  not  only  a  few  (sparse)  groups  of  atoms  are  active  at  a  time,  but  also  each  group  enjoys 
internal  sparsity. 1  At  the  conceptual  level,  this  means  that  the  signal  is  represented  by  a  few  groups 
(classes),  and  inside  each  group  only  a  few  members  are  active  at  a  time.  A  simple  example  of  this  is  a 
piece  of  music  (numerous  applications  in  genomics  and  image  processing  exist  as  well),  where  only  a  few 
instruments  are  active  at  a  time  (each  instrument  is  a  group),  and  the  sound  produced  by  each  instrument 
at  each  instant  is  efficiently  represented  by  a  few  atoms  of  the  sub-dictionary/group  corresponding  to 
it.  Thereby,  this  proposed  hierarchical  sparse  coding  framework  permits  to  efficiently  perform  source 
identification  and  separation,  where  the  individual  sources  (classes/groups)  that  generated  the  signal  arc 
identified  at  the  same  time  as  their  representation  is  reconstructed  (via  the  sparse  code  inside  the  group). 
An  efficient  optimization  procedure,  guaranteed  to  converge  to  the  global  optimum,  is  proposed  to  solve 
the  hierarchical  sparse  coding  problems  that  arise  in  our  framework.  Theoretical  recovery  bounds  arc 
derived  for  this  hierarchical  sparse  model  ( HiLasso ). 

Then,  we  go  one  step  beyond  this.  Continuing  with  the  above  example,  if  we  know  that  the  same  few 
instruments  will  be  playing  simultaneously  during  different  passages  of  the  piece,  then  we  can  assume 
that  the  active  groups  at  each  instant,  within  the  same  passage,  will  be  the  same.  We  can  then  exploit  this 
information  by  applying  the  new  hierarchical  sparse  coding  approach  in  a  collaborative  way,  enforcing 
that  the  same  groups  will  be  active  at  all  instants  within  a  passage  (since  they  are  of  the  same  instruments 
and  then  efficiently  representable  by  the  same  sub-dictionaries),  while  allowing  each  group  for  each  music 

'While  we  here  consider  only  2  levels  of  sparsity,  the  proposed  framework  is  easily  extended  to  multiple  levels. 
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instant  to  have  its  own  unique  internal  sparsity  pattern  (depending  on  how  the  sound  of  each  instrument  is 
represented  at  each  instant).  We  propose  a  collaborative  hierarchical  sparse  coding  framework  addressing 
exactly  this  (C-HiLasso).2  An  efficient  optimization  procedure  for  this  case  is  derived  as  well.  For  this 
case,  we  comment  on  results  regarding  the  correct  recovery  of  the  underlying  active  groups. 

Our  proposed  optimization  technique  for  both  sparse  coding  problems  combines  the  Sparse  Reconstruc¬ 
tion  by  Separable  Approximation  (SpaRSA)  [10],  with  the  Alternating  Direction  Method  of  Multipliers 
(ADMOM)  [11],  a  general  puipose  optimization  tool  that  has  been  successfully  employed  to  efficiently 
solve  (i -constrained  regularization  problems  [12],  [13],  [14].  Our  algorithm  iteratively  alternates  between 
a  scalar  thresholding  at  the  coefficient  level  and  a  vector  thresholding  at  the  group  level,  naturally  yielding 
to  the  desired  hierarchical  sparsity  patterns  in  the  solutions  and  converges  to  the  global  optimum. 

The  rest  of  the  paper  is  organized  as  follows:  Section  II  provides  an  introduction  to  traditional  sparse 
modeling  and  presents  our  proposed  HiLasso  and  C-HiLasso  models.  After  introducing  the  models,  we 
discuss  their  relationship  with  the  recent  works  of  [2],  [15],  [16],  [17],  [18],  [19]  is  detailed.  In  Section  III 
we  describe  the  optimization  techniques  applied  to  solve  the  resulting  sparse  coding  problems.  Recovery 
guarantees  for  HiLasso  in  the  noiseless  setting  are  developed  in  Section  IV.  We  also  comment  on  existing 
results  regarding  correct  recovery  of  group-sparse  patterns  in  the  collaborative  case.  Experimental  results 
and  simulations  are  given  in  Section  V,  and  finally  concluding  remarks  are  presented  in  Section  VI. 

II.  Collaborative  Hierarchical  Coding 
A.  Background:  Lasso  and  Group  Lasso 

Assume  we  have  a  set  of  data  samples  Xj  £  Mm, j  =  1  ,...,n,  and  a  dictionary  of  p  atoms  in 
Mm,  assembled  as  a  matrix  D  £  Mmxp,  D  =  [did2  •  • .  dp].  Each  sample  x;  can  be  written  as  x;  = 
Daj  +  e>  a j  G  e  £  Rm,  that  is,  as  a  linear  combination  of  the  atoms  in  the  dictionary  D  plus  some 
perturbation  e,  satisfying  ||e||2  <C  ||xj||9.  The  basic  underlying  assumption  in  sparse  modeling  is  that,  for 
all  or  most  j,  the  “optimal”  reconstruction  a j  has  only  a  few  nonzero  elements.  Formally,  if  we  define 
the  fo  cost  as  the  pseudo-norm  counting  the  number  of  nonzero  elements  of  ay,  |a;/  |:(J  =  | { /;:  :  akj  f  0} | , 
then  we  expect  that  |a;  ||()  and  |  a  j  1 1 0  <C  m  for  all  or  most  j. 

2Note  that  different  recordings  can  also  have  different  instruments,  so  some  of  them  will  share  the  same  groups  while  not 
necessarily  all  of  them  will  be  exactly  the  same. 
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The  i{)  optimization  is  non-convex  and  known  to  be  NP-hard.  To  determine  a j  in  practice,  a  multitude 
of  efficient  algorithms  have  been  proposed,  which  achieve  high  recovery  rates.  The  t\ -minimization 
method  is  the  most  extensively  studied  recovery  technique.  In  this  approach,  the  non-convex  /iq  norm  is 
replaced  by  the  convex  i\  norm,  leading  to 

min  1 1  all -i  s.t.  ||x,-  —  Da||9  <  e.  (II.  1 ) 

agRp  1  J  2 

The  use  of  general  puipose  or  specialized  convex  optimization  techniques  allows  for  efficient  reconstruc¬ 
tion  using  this  strategy.  The  above  approximation  is  known  as  the  Lasso  [20]  or  basis  pursuit  [21],  [22], 
A  popular  valiant  is  to  use  the  unconstrained  version 

1  „  ll9  „  „ 

min  -  ||x;  —  Da|L  +  A  llalL  ,  (II. 2) 

agRp  2  z 

where  A  is  an  appropriate  parameter  value,  usually  found  by  cross-validation. 

The  fact  that  the  || - 1|  1  regularizer  induces  sparsity  in  the  solution  a j  is  desirable  not  only  from  a 
regularization  point  of  view,  but  also  from  a  model  selection  perspective,  where  one  wants  to  identify  the 
relevant  features  or  factors  (atoms)  that  conform  each  sample  Xj.  In  many  situations,  however,  the  goal 
is  to  represent  the  relevant  factors  not  as  singletons  but  as  groups  of  atoms.  For  a  dictionary  of  p  atoms, 
we  define  groups  of  atoms  through  their  indexes,  G  C  {1 , ,p}.  Given  a  group  G  of  atoms  from  a 
dictionary  D,  we  denote  the  subdictionary  formed  by  them  as  Dg,  and  the  corresponding  set  of  linear 

reconstruction  coefficients  as  slg ■  Define  Q  =  {Gi, . . . ,  G\g\}  to  be  a  partition  of  {1, . . .  ,p}.3  In  order 

to  perform  model  selection  at  the  group  level  (relative  to  the  partition  Q),  the  Group  Lasso  problem  was 
introduced  in  [1], 

1  9 

min  -  1 1 xy  —  DaUg  +  Xipg{a),  (II.3) 

where  'ipg  is  the  Group  Lasso  regularizer  defined  in  terms  of  Q  as  ipg  (a)  =  The  function 

ipg  can  be  seen  as  an  l\  norm  on  Euclidean  norms  of  the  vectors  formed  by  coefficients  belonging  to  the 
same  group  ag.  This  is  a  generalization  of  the  i\  regularizer,  as  the  latter  arises  from  the  special  case 
Q  =  {1,  2, . . .  ,p}  (the  groups  are  singletons),  and  as  such,  its  effect  on  the  groups  of  a  is  also  a  natural 
generalization  of  the  one  obtained  with  the  Lasso:  it  “turns  on/off”  atoms  in  groups. 

We  can  always  consider  the  “noiseless”  sparse  coding  problem  minaeRp  {Xip(a)  :  x;  =  Da}, 
for  a  generic  regularizer  ip(-),  as  the  limit  of  the  Lagrangian  sparse  coding  problem 

3While  in  this  paper  we  concentrate  and  develop  the  important  non-overlapping  case,  it  will  be  clear  that  the  concepts  of 
collaborative  hierarchical  sparse  modeling  introduced  here  apply  to  the  case  of  overlapping  groups  as  well. 
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1  nonzero  group  □  = 
nonzero  coefficient 


dictionary 

I -  D 


Fig.  1.  Sparsity  patterns  induced  by  HiLasso  (left)  and  collaborative  HiLasso  (right)  model  selection  programs.  Notice  that  the 
C-HiLasso  imposes  the  same  group-sparsity  pattern  in  all  the  samples  (same  class),  whereas  the  in-group  sparsity  patterns  can 
vary  between  samples  (samples  themselves  are  different). 


minagRp  1 1  || Xj  —  DaUg  +  A,0(a)|  when  A  — >  0.  In  the  remainder  of  this  section,  as  well  as  in 
Section  III,  we  only  present  the  corresponding  Lagrangian  formulations. 


B.  The  Hierarchical  Lasso 

The  Group  Lasso  trades  sparsity  at  the  single-coefficient  level  with  sparsity  at  a  group  level,  while, 
inside  each  group,  the  solution  is  generally  dense.  Let  us  consider  for  example  that  each  group  is  a 
sub-dictionary  trained  to  efficiently  represent,  via  sparse  modeling,  an  instrument  or  a  type  of  image,  or 
a  given  class  of  signals  in  general.  The  entire  dictionary  D  is  then  appropriate  to  represent  all  classes 
of  the  signal  as  well  as  mixtures  of  them,  and  Group  Lasso  will  properly  represent  sparse  mixtures  with 
one  group  or  sub-dictionary  per  class).  At  the  same  time,  since  each  class  is  properly  represented  in  a 
sparse  mode  via  its  corresponding  group  or  sub-dictionary,  we  expect  sparsity  inside  its  groups  as  well 
(this  is  not  achieved  by  group  Lasso,  whose  solutions  are  dense  inside  each  group).  This  will  become 
even  more  critical  in  the  collaborative  case,  where  signals  will  share  groups  because  they  are  of  the  same 
class,  but  will  not  necessarily  share  the  full  active  sets,  since  they  are  not  the  same  signal.  To  achieve  the 
desired  in-group  sparsity,  we  simply  re-introduce  the  i\  regularizer  together  with  the  group  regularize^ 
leading  to  the  proposed  Hierarchical  Lasso  ( HiLasso )  model,4 

1  9 

min  -  ||xj  -  DalL  +  A 2V’e(a)  +  Ai  1 1 a 1 1 ,  .  (II.4) 

aSRp  2 

The  hierarchical  sparsity  pattern  produced  by  the  solutions  of  (II.4)  is  depicted  in  Figure  1  (left).  For 
simplicity  of  the  description,  we  assume  that  all  the  groups  have  the  same  number  of  elements.  The 
extension  to  the  general  case  is  obtained  by  multiplying  each  group  norm  by  the  square  root  of  the 

4We  can  similarly  define  a  hierarchical  sparsity  model  with  £q  instead  of  i\. 
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lambdal =0.125  lambda2=0.25 


f 

1  5 

8 

Fig.  2.  Effect  of  different  combinations  of  Ai  and  A2  on  the  solutions  of  the  HiLasso  coding  problem.  Three  cases  are  given 
in  which  we  want  to  recover  a  sparse  signal  (red  crosses)  ao  by  means  of  the  solution  a  of  the  HiLasso  problem  (blue  dots).  In 
this  example  we  have  two  active  groups  out  of  ten  possible  (the  sub  dictionaries  associated  to  each  group  have  30  atoms)  and 
ao  =  8  (four  non-zero  coefficient  per  active  group).  The  best  estimate  is  shown  in  the  top  left.  As  the  ratio  A2/A1  increases 
(bottom  left),  the  level  sets  of  the  regularizer  ipg(-)  become  rounder,  thus  encouraging  denser  solutions.  This  is  depicted  in  the 
rightmost  figure  for  a  simple  case  of  \Q\  =  1.  Increasing  Ai  again  (bottom  right)  increases  sparsity,  although  here  the  final  effect 
is  too  strong  and  some  non-zero  coefficients  are  not  detected. 


corresponding  group  size.  This  model  then  achieves  the  desired  effect  of  promoting  sparsity  at  the 
group/class  level  while  at  the  same  time  leading  to  overall  sparse  feature  selection. 

The  selection  of  Ai  and  A2  has  an  important  influence  on  the  sparsity  of  the  obtained  solution.  Intuitively 
as  A2/A1  increases,  the  group  constraint  becomes  dominant  and  the  solution  tends  to  be  more  sparse  at 
a  group  level  but  less  sparse  within  groups  (see  Figure  2). 


C.  Collaborative  Hierarchical  Lasso 

In  numerous  applications,  one  expects  that  certain  collections  of  samples  share  the  same  active 
components  from  the  dictionary,  that  is,  that  the  indexes  of  the  nonzero  coefficients  in  a j  are  the  same 
for  all  the  samples  in  the  collection.  Imposing  such  dependency  in  the  i\  regularized  regression  problem 
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gives  rise  to  the  so  called  collaborative  (also  called  “multitask”  or  “simultaneous”)  sparse  coding  problem 
[4],  [8],  [9],  [23], 

More  specifically,  if  we  consider  the  matrix  of  coefficients  A  =  [ai, . . .  ,an]  £  W>xn  associated  with 
the  reconstruction  of  the  samples  X  =  [xi, . . . ,  xn]  £  Wnxn,  the  collaborative  sparse  coding  model  is 
given  by 

1  p  n 

min  -  IIX- DA||2f  + A  V  \\ak  ,  (II.5) 

AeRpx"  2  11  ^  II  2’ 

k= 1 

where  ak  £  Mn  is  the  k-th  row  of  A,  that  is,  the  vector  of  the  n  different  values  that  the  coefficient 
associated  to  the  k- th  atom  takes  for  each  sample  j  =  1, . . . ,  n.  If  we  now  extend  this  idea  to  the  Group 
Lasso,  we  obtain  a  collaborative  Group  Lasso  ( C-GLasso )  formulation, 

min  -  || X  -  DA|| %  +  Xipg(A),  (II.6) 

AeMPXTi  2 

where  'ipg(A)  =  J^pjcg  |  Ar'  j|  F,  being  A°  the  submatrix  formed  by  all  the  rows  belonging  to  group  G. 
This  regularize!'  is  the  natural  extension  of  the  regularizer  in  (II.3)  for  the  collaborative  case. 

In  this  paper  we  are  moving  one  step  forward  and  treat  this  together  with  the  hierarchical  extension 
presented  in  the  previous  section.  The  combined  model  that  we  propose,  C-HiLasso,  is  given  by 

1  n 

min  -  ||X  -  DA||p  +  A2^p(A)  +  V' Ai  Hajl^  .  (II.7) 

AeRj>x"  2  '  1 

3= 1 

The  sparsity  patterns  obtained  with  solutions  to  (II.7)  is  shown  in  Figure  l(right).  The  collaborative 
Group  Lasso  is  a  particular  case  of  our  model  when  Ai  is  zero.  On  the  other  hand,  one  can  obtain 
independent  Lasso  for  each  x*  by  setting  A 9  to  zero.  We  see  that  (II.7)  encourages  all  the  signals  to  share 
the  same  groups  (classes),  while  the  active  set  inside  each  groups  is  signal  dependent.  We  thereby  obtain 
a  collaborative  hierarchical  sparse  model,  with  collaboration  at  the  class  level  (all  signals  collaborate 
to  identify  the  classes),  and  freedom  at  the  individual  levels  inside  the  class  to  adapt  to  each  particular 
signal.  This  new  model  is  particularly  well  suited,  for  example,  when  the  data  vectors  have  missing 
components.  In  this  case  combining  the  information  from  all  the  samples  is  very  important  in  order  to 
lead  to  a  correct  representation  and  model  (group)  selection.  This  can  be  done  by  slightly  changing  the 
data  term  in  (II. 7).  For  each  data  vector  x;/  one  computes  the  reconstruction  error  using  only  the  observed 
elements.  Note  that  the  missing  components  do  not  affect  the  other  terms  of  the  equation.  Examples  will 
be  shown  in  Section  V. 


D.  Relationship  to  Recent  Literature 


A  number  of  recent  works  have  addressed  hierarchy,  grouping  and  collaboration  within  the  sparse 
modeling  community.  We  now  discuss  the  most  closely  related  to  the  proposed  C-HiLasso  model. 

In  [2],  the  authors  propose  a  general  framework  in  which  one  can  define  a  regularization  term  to 
encourage  a  variety  of  sparsity  patterns,  and  provide  theoretical  results  (different  than  the  ones  developed 
here)  for  the  single-signal  case.  The  HiLasso  model  presented  here,  in  the  single  signal  scenario,  can 
be  seen  as  a  particular  case  of  that  model  (were  the  groups  in  [2]  should  be  blocks  and  singletons), 
although  the  particularly  and  important  case  of  hierarchical  structure  introduced  here  is  not  mentioned  in 
that  paper.  In  [15]  the  authors  simultaneously  (see  [14])  proposed  a  model  that  coincides  with  ours  again 
in  the  single-signal  scenario.  None  of  these  approaches  develop  the  collaborative  framework  introduced 
here  nor  the  theoretical  guarantees  we  develop. 

The  recovery  of  mixed  signals  with  1$  optimization  was  addressed  in  [19].  This  model  does  not  include 
block  sparsity  (no  hierarchy),  neither  collaboration.  The  theoretical  results  we  obtain  are  not  present  in 
[19]. 

The  special  case  of  C-HiLasso  when  Ai  =  0,  which  we  refer  to  as  collaborative  Group  Lasso  (C- 
GLasso)  here,  is  investigated  in  [24],  where  a  theoretical  analysis  of  the  signal  recovery  properties  of 
the  model  is  developed.  Collaborative  coding  with  structured  sparsity  has  also  been  used  recently  in 
the  context  of  gene  expression  analysis  [16],  [17].  In  [16],  the  authors  propose  a  model,  that  can  be 
interpreted  as  a  particular  case  of  the  collaborative  approach  presented  here,  in  which  a  set  of  signals 
is  simultaneously  coded  using  a  small  (sparse)  number  of  atoms  of  the  dictionary.  They  modify  the 
classical  collaborative  sparse  coding  regularization  so  that  each  signal  can  use  any  subset  of  the  detected 
atoms.  This  is  equivalent  to  our  model  when  the  groups  have  only  one  element  and  therefore  there  is 
no  hierarchy  in  the  coding.  A  collaborative  model  is  presented  in  [17],  where  signals  sharing  the  same 
active  atoms  arc  grouped  together  in  a  hierarchical  way  by  means  of  a  Lee  structure.  The  regularization 
term  proposed  is  analogous  to  the  one  proposed  in  our  work,  but  it  is  used  to  group  signals  rather  than 
atoms  (features),  having  once  again  no  hierarchical  coding. 

Trees  have  also  been  used  recently  to  learn  dictionaries  [18].  The  tree-based  sparse  coding  is  such  that 
if  a  particular-  learned  atom  is  not  used  in  the  decomposition  of  a  signal,  then  none  of  its  descendants 
(in  terms  of  the  given  tree  structure)  can  be  used. 


To  conclude,  while  particular-  instances  of  the  proposed  C-HiLasso  have  been  recently  reported  in  the 
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literature,  none  of  them  arc  as  comprehensive.  C-HiLasso  includes  both  collaboration,  at  a  block/group 
level,  and  hierarchical  coding.  Such  collaborative  hierarchical  structure  is  novel  and  fundamental  to 
address  new  important  problems  such  as  collaborative  source  identification  and  separation.  The  new 
theoretical  results  presented  here  which  extend  the  block  sparsity  results  of  [3],  [25]  complement  the 
previous  modeling  and  algorithmic  work. 

III.  Optimization 

A.  Single-Signal  Problem:  HiLasso 

In  the  last  decade,  optimization  of  problems  of  the  form  of  (II.2)  and  (II.3)  have  been  deeply  studied,  and 
there  exist  very  efficient  algorithms  for  solving  them.  Recently,  Wright  et.  al  [10]  proposed  a  framework, 
SpaRSA,  for  solving  the  general  problem 

min  /(a)  +  XipiaP).  (III. 8) 

aSRp 

To  guarantee  convergence,  /  :  W'  — >  M  needs  to  be  a  smooth  and  convex  function,  while  ip  :  W  — >  M  only 
needs  to  be  finite  in  LRT.  This  formulation  includes  as  important  particular  cases  the  Lasso,  Group-Lasso 
and  HiLasso  problems  by  setting  /(•)  as  the  reconstruction  error  and  then  choosing  the  corresponding 
regularizes  for  ip(-).  When  the  regularizer,  ip(-),  is  group  separable,  the  optimization  can  be  subdivided 
into  smaller  problems,  one  per  group.  The  framework  becomes  powerful  when  these  subproblems  can  be 
solved  efficiently.  This  is  the  case  of  the  Lasso  and  Group  Lasso  (with  non  overlapping  groups)  settings 
but  is  not  immediate  with  the  HiLasso  regularizer  (II. 4).  In  this  work  we  combine  SpaRSA  with  the 
ADMOM  method  [11]  to  efficiently  solve  the  HiLasso  problem. 

The  SpaRSA  algorithm  generates  a  sequence  of  iterates  {a(/i  }/Gf,  that,  under  certain  conditions, 
converges  to  the  solution  of  (III. 8).  At  each  iteration,  a^+ 1 1  is  obtained  by  solving 

min  (z  —  a^)TV/(a^)  +  z  —  a ^  (III.9) 

for  some  sequence  of  parameters  G  M+,  which  needs  to  be  chosen  properly  for  the 

algorithm  to  converge  (see  [10]  for  details).  It  is  easy  to  show  that  (III.9)  is  equivalent  to 

1  2  \ 

min-  z  —  -| - -r- ib(z),  (III.  10) 

zeR?  2  2  a(f)  W 

where  =  a—  _  -L.  V/(a^).  In  this  new  formulation,  it  is  clear  that  the  first  term  in  the  cost  function 
can  be  separated  element-wise.  Thus,  when  the  regularization  function  'ip(z)  is  group  separable,  so  is  the 
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overall  optimization,  and  one  can  solve  (III.  10)  independently  for  each  group,  leading  to 


1 

min  - 

zgGR|Gi  2 


(t) 

z G  -  U^/ 


2  A  ,  ,  . 

+  ^G(zG), 

2  Q\t> 


zG  being  the  corresponding  variable  for  the  group.  In  the  case  of  HiLasso,  this  becomes, 


1  1 1  -i  1 1 2  |  A  2 

nun  —  b  —  w  0  H - 7-, 

beR'Gi  2  11  112  QW 


-^l|b||9  +  ^||b 


a 


(t) 


11  > 


(III.  11) 


where  w  =  is  the  subvector  of  —  -^jD^(Da^  —  x)  indexed  by  G.  Problem  (III.  11)  is  a 

second  order  cone  programing  (SOCP),  for  which  one  could  use  generic  solvers.  However,  this  subproblem 
needs  to  be  solved  many  times  within  the  SpaRSA  iterations,  so  it  is  crucial  to  solve  it  efficiently. 


To  obtain  an  efficient  implementation  of  SpaRSA,  we  use  the  ADMOM  method  [11],  The  idea  is  to 
solve  the  artificially  constrained  equivalent  problem, 

1  9~ 

min  -  ||b  -  w||;  +  A2  \\/3\\2  +  Ai  ||b|| ,  ,  s.t.  b  =  P, 

beRiGi  2  112  1 

where  A,;  =  \j/a(tK  The  algorithm  generates  a  set  of  iterates  {b^, /3^, p^}teN+  which  converges  to 
the  minimum  of  the  Augmented  Lagrangian 

Lc{  b,  /3,  p)  =  ^ I]  b  -  w||2  +  A2  ||/3||2  +  Ai  ((b^  +  pT(b  ~P)  +  ^\\b-  (3\\22  . 

The  elements  of  p  are  the  so  called  Lagrangian  multipliers,  and  c  >  0  is  the  augmented  Lagrangian 
penalty  term,  which  is  a  parameter  of  the  algorithm.  The  idea  of  adding  a  quadratic  term  |  ||b  —  j3\\\  to 
the  Lagrangian  is  to  render  the  corresponding  primal  cost  function  strictly  convex,  so  that  the  dual  function 
is  differentiable.  This  in  turn  allows  the  optimization  to  be  carried  out  using  primal-dual  iterations,  where 
the  dual  variables,  in  this  case  p,  can  be  updated  efficiently  using  gradient  ascent.  The  method  always 
converges  to  the  optimum,  and  the  parameter  c  needs  to  be  chosen  empirically  in  order  to  obtain  a  good 
convergence  rate. 


Each  iteration  of  the  ADMOM  method  updates  b,  /3  and  p,  one  at  a  time,  while  leaving  the  others 
fixed.  The  updates  of  b  and  /3  are  obtained  via  exact  minimization  of  the  augmented  Lagrangian  in  b  and 
P  respectively,  while  the  update  of  the  Lagrangian  multiplier  p  is  a  gradient  ascent  towards  the  solution 
of  the  dual  problem.  This  leads  to  the  iterations: 

1 

1— 

2 


b(t+b  =argmin^  ||b  -  w||2  +  Ai  ||b|| x  +  bTp  +  ^  ||b  -  , 


1 2 
I2  > 


(III.  12) 


/?(t+1)  =argmin  A2 

0 

p(*+D  =p  +  c(b(t+1)-^+1)). 


I2  PTP  +  7T  b^-P 
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For  convenience  of  notation  we  omitted  the  super-indexes  for  the  iterates  at  step  t,  just  explicitly  indexing 
them  at  step  t  +  1.  For  more  details  on  augmented  Lagrangian  methods  see  [11,  Chapter  3]. 


The  update  for  b  is  separable  into  scalar  subproblems  on  the  coordinates  of  b.  The  optimality  conditions 
on  the  subgradient  of  each  of  these  scalar  problems  leads  to  a  simple  valiant  of  the  well  known  soft- 
thresholding  operator,  S(w{,  A)  =  sgTifm,.)  max  {(),  | —  A}.  We  use  S( w,A)  to  denote  the  vector 
obtained  when  applying  the  soft-thresholding  operator  (with  parameter  A)  to  each  element  of  w.  On 
the  other  hand,  the  update  for  j3  is  not  separable  into  scalar  subproblems.  However,  we  show  now  that  its 
solution  can  be  obtained  in  closed  form  via  vector  shrinkage.  For  this,  we  write  the  optimality  conditions 
on  the  subgradient  of  (III.  12), 

A2<9 \\/3\\2  -  p  -  c(b  -  f5)  B  0  =>•  c/3  +  A2<9  \\/3\\2  -  p  -  cb  9  0, 


and  perform  the  change  of  variables  j5'  =  c/5  and  b'  =  p  +  cb.  Since  the  subgradient  of  ||b||2  is  not 
modified  by  a  constant  scaling  of  the  argument,  we  obtain  an  equivalent  optimality  condition,  /5'  + 
A29  || /^ || 2  —  b'  B  0,  which  is  exactly  the  one  leading  to  the  vector  shrinkage  operator,  Sv  described 
in  [1]  for  the  Group  Lasso  (actually  much  simpler,  since  there  is  no  matrix  multiplication  involved), 


S«(b',A2)  = 
and  (5, 


1  - 


b'.  After  reverting  the  change  of  variables,  we  get  closed  form  updates  for  b 


b  =  S(w+c/5-  p,  Ai)  and  f5  =  -  At,(p+  cb,  A2). 

c+1  c 


The  complete  HiLasso  optimization  algorithm  is  summarized  in  Algorithm  1 .  The  parameter  r/  is  needed 
to  guarantee  the  convergence  of  SpaRSA  and  has  very  little  influence  in  the  overall  performance  (see  [10] 
for  details),  we  used  r]  =  2  in  all  our  experiments.  An  additional  speed  up  is  obtained  by  bypassing 
ADMOM  when  a  whole  group  is  not  active.  From  the  optimality  conditions  of  (III.  11),  it  follows  that  if  0 
is  a  solution  when  Ai  =  0  (standard  Group  Lasso)  or  A2  =  0  (Lasso),  it  is  also  a  solution  in  the  general 
case.  This  can  be  simply  checked  by  evaluating  Sv(  w .  A2)  >  0  and  5  ( w  ,  A  r )  >  0  before  proceeding 
with  the  ADMOM  algorithm. 

It  is  interesting  to  note  that  when  either  Ai  =  0  or  A2  =  0,  the  ADMOM  technique  is  not  needed 
and  the  SpaRSA  subproblem  (III.  11)  is  solved  exactly  in  one  step  using  either  a  soft  thresholding  (when 
A2  =  0)  or  a  vector  thresholding  (when  Ai  =  0).  In  particular,  when  A2  =  0,  the  proposed  optimization 
then  reduces  to  the  Iterative  Soft  Thresholding  algorithm  [26]. 


Input:  Data  X,  dictionary  D,  group  set  Q,  constants  77  >  1,  c  >  0,  0  <  am\a  <  ow 
Output:  The  optimal  point  a* 

Initialize  t  :=  0,  a(0)  :=  0; 
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while  stopping  criterion  is  not  satisfied  do 
choose  e  [am,ttmax]; 
set  uw  :=  a(t)  -  ^V/( a(t)); 
while  stopping  criterion  is  not  satisfied  do 

//Here  we  use  the  group  separability  of  (III.  10)  and  solve  (tll.l  I )  for  each  group 
for  i  :=  1  to  \Q\  do 

if  <S„(w,  A2)  >  0  then 
set  r  :=  0; 

choose  an  initial  p° ,  (3° ,  b° ; 
while  stopping  criterion  is  not  satisfied  do 
b(r+1)  =  +c/?M  -p(r),Ar); 

/3(r+i)  =  i5„(p(r)  +  cffir+1\A2); 

p(r-+i)  =  p(r)  +  c(b(r+l)  _  ^(r+l)). 


else 


end 

end 

set 

end 

set  t  t  +  1  ; 


set  r  : 

=  r  +  1 

end 

set  ag+1) 

:=  br+1 

set 

:=  0; 

:=  ija^; 

end 


Algorithm  1:  HiLasso  optimization  algorithm. 


B.  Optimization  of  the  Collaborative  HiLasso 

We  now  propose  an  optimization  algorithm  to  efficiently  solve  the  collaborative  HiLasso.  The  main 
idea  is  to  include  a  second  use  of  ADMOM  in  order  to  divide  the  overall  problem  into  two  subproblems: 
one  that  breaks  the  multi-signal  problem  into  n  single-signal  i\ -based  sparse  codings,  and  another  that 
treats  the  multi-signal  case  as  a  single  Group  Lasso-like  problem.  In  this  way  we  take  advantage  of  the 
separability  of  each  term. 

We  define  a  constrained  optimization  problem, 

min  -  II X  —  DA II  p+  AiV^||a7- 1| ,  +A2^c(B)  s.t.  A=B. 

A  13  z~TD)  n  X  n  )  r  /  ^  J  1 
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The  ADMOM  iterations  are  given  by  (we  omitted  the  super-index  for  variables  at  iteration  t  for  notational 
convenience). 


A(t+1)  =argmin-  ||X  -  DA||F  +  Ai  llajlli  +  Tr(ATP)  +  -  ||B  -  A"2 


g(t+t) 


A  2 

.  c 

argmm- 
B  2 


F  > 


B  -  A(m) 


+  Tr(BTP)  +  A2^(B), 


pd+i)  =P  +  c(a(*+1)  -  B(t+1)). 


(III.  13) 
(III.  14) 


Solving  for  Ai:/4li;  Problem  (III.  13)  can  be  separated  into  n  single-signal  subproblems  by  updating  one 
column  of  the  matrix  A  at  a  time, 

“j-n,  g  l!x”  Dai II2  +  Pjai  +  \  llaf“  bll2  +  Ai  Hailli  • 

This  problem  can  be  solved  using  the  SpaRSA  framework  following  the  ideas  used  in  the  previous  section. 
In  this  case  the  function  /  on  (III. 8)  would  be  the  first  three  terms  on  the  equation  above  and  'ip  is  the 
standard  l\  regularization  term.  The  idea  is  to  consider  the  first  three  terms  of  the  cost  as  /(•)  in  Equation 
(III. 8).  The  associated  computational  cost  is  equivalent  to  the  one  of  the  Lasso,  since  the  regularizer  is 
the  standard  t\  norm. 


Solving  for  B^t+1^:  The  problem  given  by  (III.  14)  is  group  separable,  as  a  direct  consequence  of  the 
separability  of  'ipg  for  the  case  of  non  overlapping  groups.  Thus,  we  need  to  solve  \Q\  optimization 
problems  of  the  form. 


mm  - 

Bg£1jX"  2 


B  g  -  A  G 


(t+ 1) 


+  Tr(pg+1)B£)  +  A2||BG||F, 


where  A q,  Bg  and  PG  are  the  \C\  x  n  sub- matrices  of  A,  B  and  P  associated  with  the  group  G 
respectively.  We  express  them  as  column  vectors  (each  with  \G\n  components)  by  concatenating  their 
columns,  obtaining  bG,aG  and  pG  respectively,  and  rewrite  the  optimization  problem  in  vectorial  form 
as 


b£S,>  IM2-pGb  +  5 


,(t+ 1) 


-b 


This  problem  is  identical  to  (III.  12)  and  can  be  reduced  to  a  Group  Lasso  problem  by  simply  changing 
variables  and  thus,  it  is  solved  using  vector  thresholding. 


Similarly  to  what  we  obtained  in  the  single  signal  case,  the  developed  optimization  procedure  alternates 
between  a  Lasso-like  problem  (solved  by  an  iterative  shrinkage  procedure,  via  SpaRSA)  applied  to  each 
signal  independently,  and  a  vector  soft-thresholding  at  a  group  level.  The  important  difference  here  is  that 
the  group  thresholding  is  applied  considering  all  the  signals.  Intuitively  this  means  that  for  a  group  to  be 
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treated  as  active,  the  atoms  of  its  corresponding  subdictionary  need  to  be  relevant  for  all  (or  a  significant 
number  of)  the  signals  in  the  set,  which  translates  into  robustness  in  the  model  (class)  selection. 

As  with  models  such  as  Lasso  and  Group  Lasso,  the  optimal  parameters  Ai  and  A2  are  application  and 
data  dependent.  In  some  specific  cases,  closed  form  solutions  exist  for  such  parameters.  For  example,  for 
signal  restoration  in  the  presence  of  noise  and  using  Lasso  (A2  =  0),  the  GSURE  method  gives  a  closed 
form  solution  for  Ai  [27].  As  extending  such  methods  to  C-HiLasso  is  beyond  the  scope  of  this  work, 
we  rely  on  cross-validation  for  the  choice  of  such  parameters.  Our  optimization  technique  also  relies  on 
the  choice  of  the  augmented  Lagrangian  quadratic  penalty  c.  Although  there  is  no  closed  form  solution 
for  an  optimal  value  of  c,  in  practice  the  performance  of  the  algorithms  is  very  robust  to  its  choice,  and 
a  small  coarse  grid  is  enough  to  choose  a  single  suitable  value  that  works  well  for  the  range  of  problems 
in  Section  V. 


IV.  Theoretical  guarantees 

In  our  current  theoretical  analysis,  we  analyze  the  case  of  a  single  measurement  vector  (signal)  x  (we 
comment  on  the  collaborative  case  at  the  end  of  this  section),  and  assume  that  there  is  no  measurement 
noise  or  perturbation,  so  that  x  =  Da.  We  attempt  to  recover  a  by  solving  the  noise-free  HiLasso 
problem: 

min  {Xibgia)  +  (1  —  A)  llall  1  s.t.  x  =  Da}  .  (IV.15) 

aelRp  L  1  J 

Note  that  we  have  replaced  the  two  regularization  parameters  Ai  and  A2  by  a  single  parameter  A,  since 
scaling  does  not  effect  the  optimal  solution.  We  can  always  assume  that  Ai  +  A2  =  1. 

Our  goal  is  to  develop  guarantees  under  which  the  HiLasso  program  of  (IV.15)  will  recover  the  true 
unknown  vector  a.  We  assume  throughout  this  section  that  a  has  group  sparsity  k,  namely,  not  more 
than  k  of  the  group  vectors  a^,  (?£(/,  have  non-zero  norm.  In  addition,  within  each  group,  we  assume 
that  not  more  than  s  elements  are  non  zero,  that  is,  |ac;||o  <  s.  Without  loss  of  generality,  we  further 
assume  that  the  length  of  each  vector  ac  is  equal  to  g. 

For  A  =  1  the  problem  (IV.15)  reduces  to  the  mixed  -^2/^1  problem  formally  studied  in  [3],  [25].  When 
A  =  0,  (IV.15)  becomes  equivalent  to  the  well-known  Lasso,  or  basis  pursuit  algorithm.  Both  cases  have 
been  treated  previously  in  the  literature  and  sufficient  conditions  have  been  derived  on  the  sparsity  levels 
and  on  the  dictionary  D  to  ensure  that  the  resulting  optimization  problem  recovers  the  true  unknown 
vector.  For  example,  in  [3],  [28],  [29]  conditions  are  given  in  terms  of  the  restricted  isometry  property 
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(RIP)  of  D.  An  alternative  line  of  work  [25],  [30],  focused  on  coherence  guarantees,  which  arc  easier  to 
compute.  Here,  we  follow  the  same  spirit  and  consider  coherence  bounds  that  ensure  recovery  using  the 
HiLasso  approach.  We  also  draw  from  [9]  to  briefly  describe  conditions  under  which  the  probability  of 
error  of  recovering  the  correct  groups,  using  the  special  case  of  the  C-HiLasso  with  Ai  =  0  (C-GLasso), 
falls  exponentially  to  0  as  the  number  of  collaborating  samples  n  grows. 

A.  Block-Sparse  Coherence 

We  begin  by  reviewing  previously  proposed  coherence  measures.  For  a  given  dictionary  D,  the 
(standard)  coherence  is  defined  as  //  =  max^y*  | d'/  (ly  | .  where  d,  denotes  the  /th  column  of  the  dictionary 
D.  This  coherence  was  extended  to  the  block-sparse  setting  in  [25],  leading  to  the  definition  of  block 
coherence:5 

Pb  =  max  ~p( Df  Dj), 

9 

where  p(-)  is  the  spectral  norm,  that  is,  p{ Z)  =  Amax(Z7  Z)  with  Amax(W)  denoting  the  largest  eigenvalue 
of  the  positive  semi-definite  matrix  W.  When  g  =  1  (each  block  is  a  singleton),  D,  =  dt,  so  that  as 
expected,  ps  =  P-  While  ps  quantifies  global  properties  of  the  dictionary  D,  local  properties  are 
characterized  by  the  sub-coherence  of  D,  defined  as 

v  =  max  <  max|d^dj|,  dj,dg  6  Dg>  >  .  (IV.16) 

G&Q  ) 

We  define  u  =  0  for  g  =  1.  Clearly,  if  the  columns  of  Dp  arc  orthonormal  for  each  group  C,  then  u  =  0. 
Assuming  the  columns  of  D  have  unit  norm,  it  can  be  easily  shown  that  p,  v  and  ps  all  lie  in  the  range 
[0, 1].  In  addition,  we  can  easily  prove  that  v  <  p  and  pn  <  p.  In  our  setting  a  is  block  sparse,  but  has 
further  internal  structure:  each  subvector  of  a  is  also  sparse.  Therefore,  we  expect  that  an  appropriate 
coherence  measure  will  be  based  on  the  definition  of  block  sparsity,  but  will  further  incorporate  the 
internal  sparsity.  Let  M  =  D7  D  denote  the  Gram  matrix  of  inner  products  of  the  column  of  D.  Then, 
the  standard  block  coherence  p  is  defined  in  terms  of  the  largest  singular  value  of  an  off-diagonal  sub¬ 
block  of  M.  In  a  similar  fashion,  we  will  define  sparse  block  coherence  in  terms  of  sparse  singular 
values.  As  we  will  see,  two  different  definitions  will  play  a  role,  depending  on  where  exactly  the  sparsity 
within  the  block  enters. 

5Each  sub-dictionary  D;  corresponds  to  one  of  the  (non-overlapping)  groups  in  Q  considered  in  HiLasso. 
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To  define  the  sparse  block  coherence  measures,  we  note  that  the  spectral  norm  p( Z)  of  a  matrix  Z 
can  be  defined  as 

p(Z)  =  max  lx1  Zy|  s.t.  ||x|j  =  1,  ||y||  =  1. 

x,y 

Alternatively,  we  can  define  p( Z)  as  above,  via  the  largest  eigenvalue  Amax  of  W  =  Z3  Z:  p( Z)  = 
VQW).  Formally,  for  any  W  A  0, 

Amax(W)  =  rnaxy7  Wy  s.t.  |y||  =  1, 
y 

We  now  develop  sparse  analogs  of  p( Z)  and  Amax(Z7  Z).  As  we  will  see,  the  simple  square-root 
relation  no  longer  holds  in  this  case.  To  define  the  sparse  largest  singular  value,  we  restrict  x  and  y  to 
be  s-sparse  [31]: 

pss{Z)  =  max  |x7  Zy|  s.t.  ||x||  =  1,  ||y||  =  1,  ||x||o  <  s,  ||y||o  <  s.  (IV.  17) 

x,y 

Similarly,  the  largest  sparse  eigenvalue  of  W  =  Zv  Z  is  defined  as  [31],  [32],  [33] 

Amax(W)  =  maxyTWy  s.t.  ||y||  =  1,  ||y||0  <  a.  (IV.  18) 

y 

The  sparse  matrix  norm  is  then  given  by 

PS(  Z)  =  \ZAmax(ZTZ).  (IV.  19) 

Note  that  in  general  ps( Z)  is  not  equal  to  pss( Z).  It  is  easy  to  see  that  pss( Z)  <  ps( Z).  Efficient 
algorithms  for  computing  ps(  Z)  (also  refeiTed  to  in  the  literature  as  sparse  PC  A)  and  pss(  Z)  can  be 
found  in  [31],  [32],  [33].  For  any  matrix  Z,  pss( Z)  =  p]!3  ZI)  and  ps( Z)  =  p(ZI),  where  I  is  a  matrix 
with  s  columns,  which  consist  of  s  nonzero  rows  and  all  remaining  rows  identically  zero.  The  nonzero 
rows  contain  only  one  nonzero  element,  taking  on  the  value  1.  The  location  of  the  ones  are  chosen  to 
maximize  the  corresponding  singular  value.  In  the  definition  of  pss .  the  two  matrices  I  do  not  necessarily 
correspond  to  the  same  support  selection.  However,  in  order  to  not  further  complicate  the  notation,  we 
denote  both  matrices  by  I. 


Using  (IV.  17)  and  (IV.  19),  we  define  two  sparse  block  coherence  measures: 


and 


Pbss  =  max  -pss(Df  Dj), 

g 


Pbs  =  max  -ps(DjDj). 

g 


(IV.  20) 


(IV.21) 
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The  choice  of  scaling  is  to  ensure  that  prf  ■  Pbss  <  /Jr-  We  also  extend  the  notion  of  sub-coherence  to 
account  for  the  internal  sparsity.  Specifically,  we  define 

vs  =  max  -p(lTDf  DA  (IV.22) 

*  9 

where  Dj  denotes  the  g  —  s  columns  of  D,  not  selected  by  I.  In  other  words,  us  measures  the  coherence 
between  the  active  part  of  each  block  and  the  non-active  section. 

The  following  proposition  establishes  some  relations  between  these  new  definitions  and  the  standard 
coherence  measures. 


Proposition  1.  The  sparse  block-coherence  measures  pbss,Pbs  satisfy 

0  <  pbss  <  ~p,  0  <  hbs  <  \  ~P- 

9  V  9 


The  sparse  sub-coherence  satisfies 


0  <  v*  < 


Vs(g  - s) 


(IV.23) 


(IV.  24) 


Proof:  The  inequalities  pbSSi  pns  >  0  follow  immediately  from  the  definition.  We  obtain  the  upper 
bounds  by  rewriting  pss( Z)  and  ps( Z)  and  then  using  the  Guersgorin  disc  theorem. 


(a) 


P " 


5(z)  =  aV^(Itzt!Fz!)  < 


\ 


max 
l 


\eir\  - 


r— 1 


s  max  |  eir  \ 

l,r 


(b) 


5(z)  =  a^(Fztz!)  < 


N 


max 
l 


r=  1 


J2\e'ir\  ^  Jsmaxle^, 


/,r 


(IV.25) 


(IV.  26) 


where  e/r  and  e'lr  arc  the  elements  of  E  =  M^iFM,j  and  E'  =  and  (a),  (6)  arc  a  consequence 

of  Gersgorin’s  disc  theorem.  The  entries  of  My  =  D/  D^  for  i  /  j  have  absolute  value  smaller  than 
or  equal  to  /j,  and  the  size  of  My  is  g  x  g.  Therefore,  |e^|  <  sp?  and  \e'k,  \  <  g/i2.  Substituting  these 
values  into  (IV.25)  and  (IV.26)  concludes  the  proof  of  the  upper  bounds  on  priss  and  pns- 


The  proof  of  us  follows  a  similar  path  where  now  the  size  of  My  =  Dj  D,  is  g  x  (g  —  s). 


B.  Recovery  Proof 

To  formally  state  our  main  recovery  result,  suppose  that  ao  is  a  block  /c-sparse  vector  with  blocks 
of  length  g,  where  each  block  has  sparsity  s,  and  let  x  =  Dao-  Our  theorem  relies  on  the  following 
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definitions,  which  we  will  use  throughout  this  section.  Let  Do  denote  the  matrix  whose  blocks  correspond 
to  the  nonzero  blocks  of  ao,  and  let  Co  be  the  corresponding  coefficient  blocks.  Denote  by  Cq  the  nonzero 
elements  of  Co  (namely,  the  nonzero  values  in  each  block),  and  similarly  let  Dg  indicate  the  respective 
matrices  on  the  support  S  so  that  x  =  Doco  =  DgCg.  We  let  Do  be  the  matrix  which  contains  the 
blocks  of  size  g  of  D  that  are  not  in  Do,  and  let  Dg  contain  the  columns  of  Do  that  are  not  in  Dg. 
Finally,  we  define  D  as  the  matrix  containing  all  columns  not  in  Dg .  In  terms  of  our  other  definitions, 
D  is  a  concatenation  of  Do  and  Dg. 

An  important  observation  that  we  will  rely  on  throughout,  is  that  the  columns  of  Dg  must  be  linearly 
independent  for  any  choice  of  Do  and  S  in  order  to  guarantee  a  unique  sparse  representation  in  our 
problem.  Otherwise,  we  can  have  two  sparse  vectors  Cq  and  cs  that  satisfy  DqCq  =  DqC5  and  no 
algorithm  will  be  able  to  distinguish  between  them.  Under  this  assumption,  (Dg)TDg  is  invertible  and 
we  can  define  the  pseudo-inverse  (Dg)t  =  ((Do)TDq  )-1(Dg  )r.  Equipped  with  these  definitions  we 
can  now  state  our  main  result. 

Theorem  1.  Let  ao  be  a  block  k- sparse  vector  with  blocks  of  length  g,  where  each  block  has  sparsity  s. 
Let  x  =  Dao  for  a  given  matrix  D.  A  sufficient  condition  for  the  HiLasso  algorithm  (IV.  15)  to  recover 
ao  is  that 

Pc((Dg  )tD0)  <  1  (IV.27) 

pc((  Df)tDf)  <  1  (IV.28) 

II  (Dq  )tD||1;1  <  1.  (IV.  29) 

Here  pc{ Z)  =  max,,  ^Tf  p(Z^r),  Z^r  denoting  the  (£,r) th  s  x  p  block  of  Z  where  p  =  g  in  (IV.27)  and 
p  =  g  —  s  in  (IV.28),  and  ||Z||ij  =  maxr  ||zr||i,  where  zr  is  the  rth  column  of  Z. 

Note  that  (IV.27)  and  (IV.28)  imply  that  for  all  r,  pc(Dg[Do]r)  <  1  and  pc(Dg[Dg]r)  <  1.  Lrom 
(IV.29)  we  have  that  ||(Dg)Mr||i  <  1,  for  every  column  dr  of  D.  On  the  other  hand,  when  dr  is  a 
column  from  Dg,  (Dg)Mr  is  a  vector  containing  the  value  1  in  the  location  corresponding  to  dr,  and 
zero  everywhere  else,  thus  ||(Dg)MI.||i  =  1.  Combining  the  last  two  observations  we  conclude  that 

||(Df)tdr||1  <  1,  Vr  (IV. 30) 

We  will  use  this  result  in  the  proof  of  the  theorem. 

The  sufficient  conditions  (IV.27)-(IV.29)  depend  on  Dg  and  therefore  on  the  nonzero  blocks  in  eg, 
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and  the  nonzero  locations  within  the  blocks,  which,  of  course,  arc  not  known  in  advance.  Nonetheless, 
below  we  will  prove  sufficient  conditions  that  ensure  that  (IV.27)-(IV.29)  holds,  which  depend  only  on 
^Bss ,  Hbs ,vs  and  u,  associated  with  the  dictionary  D. 

We  now  prove  Theorem  1. 

Proof:  To  prove  that  (IV.  15)  recovers  the  correct  vector  ao,  let  a'  be  an  alternative  solution  satisfying 
x  =  Da'.  Denote  by  Co  the  blocks  consisting  of  the  nonzero  values  of  ao  and  by  Do  the  corresponding 
columns  of  D.  Similarly,  let  c'  denote  the  blocks  consisting  of  the  nonzero  values  of  a'  and  let  D'  denote 
the  corresponding  columns  of  D.  Then  x  =  DoCo  =  D'c'. 

In  each  block  of  Co  there  are  at  most  s  nonzero  values.  We  denote  by  Cq  the  nonzero  elements  of  Co, 
and  similarly  let  Dq  indicate  the  respective  matrices  on  the  support  S.  Relying  on  the  fact  that  Dq  has 
linearly  independent  columns,  we  can  write 

C$  =  (Dq  )^Dq  Cq  =  QDqCq  ,  (IV.31) 

where  we  denoted  Q  =  (Dq)^  for  brevity.  Noting  that  DqCq  =  DoCo  =  D'c',  (IV.31)  becomes 

=  QD'c'.  (IV.  32) 

To  proceed,  we  separate  D'  into  two  parts:  blocks  that  are  contained  in  Do,  which  we  denote  by  B, 
and  blocks  that  are  not  contained  in  Do,  which  we  denote  as  R.  Thus,  with  appropriate  permutations  of 
the  blocks  of  D'  we  can  write  D'  =  [B  R] .  We  perform  the  same  permutation  on  c'  resulting  in  vectors 
b  and  r  such  that  D'c'  =  Bb  +  Rr.  Since  the  vector  b  corresponds  to  blocks  that  arc  contained  in  Do, 
we  can  further  decompose  b  as  b5  +  bs  where  bs  indicates  the  values  in  b  that  correspond  to  columns 
in  Dq  supported  on  S,  and  bs  correspond  to  the  remaining  columns.  We  similarly  decompose  the  matrix 
B  into  Bs  and  Bs.  Next  we  note  that  QBS  bs  =  Zb5,  where  Z  is  a  ks  x  rs  matrix  consisting  of 
blocks  of  size  s  x  s  that  arc  either  equal  to  the  identity,  or  to  zero.  Here  r  is  the  number  of  blocks  in 
Bs,  namely,  the  number  of  blocks  shared  by  Do  and  D'.  The  blocks  equal  to  the  identity  correspond  to 
shared  blocks.  Substituting  into  (IV.32), 

c0  =  Q(B5b5  +  Rr)  +  Zb5.  (IV.33) 

Therefore,  for  the  groups  sparsity  regularization  term  we  have 


co)  <  il>g( b5)  +  ^(QB5b5)  +  V’q(QRr). 


(IV.  34) 
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We  now  analyze  the  last  two  terms  in  (IV.34).  To  this  end,  we  rely  on  the  following  lemma  [25,  Lemma 
3], 

Lemma  1.  Let  v  be  a  vector  with  1 1  || 2  >  0,  for  all  g.  Then  for  any  matrix  Z  with  appropriate 

dimensions,  ipg( Zv)  <  pc(Z)'0p(v). 

Since  Bs  is  contained  in  Dq  and  R  is  contained  in  Do,  it  follows  from  (IV.27)  and  (IV.28)  that 
pc(QB5)  <  1  and  pc(QR)  <  1.  Combining  this  observation  with  Lemma  1  and  (IV.34)  we  conclude 
that 

Vv(co)  <  V^b5)  +  ^(b3")  +  0p(r)  =  ipg  (c7).  (IV.35) 

The  last  equality  is  a  result  of  the  fact  that  c'  is  a  concatenation  of  b5,  bs  and  r.  Since  Lq(co)  =  Vv(a o) 
and  -0 g(c ')  =  if>g(a')  we  have  that  0g(ao)  <  'i/jg(ar). 

We  now  show  in  a  similar  fashion  that  ||co||i  <  ||c,||i  or,  equivalently,  1 1 ao 1 1 1  <  ||a/||i.  From  [30, 
Theorem  3.3]  we  have  that 

||c0||i  <  ||QD,||1,1||c,||1.  (IV. 36) 

This  result  is  true  as  long  as  there  is  at  least  one  column  in  D'  that  is  not  in  Dq.  But  this  must  be  the 

case  since  by  assumption,  the  columns  of  Dq  are  linearly  independent.  Therefore,  if  D'  and  Dq  arc 

equal,  then  we  must  have  that  Cq  =  c'.  Combining  (IV.36)  with  (IV.30)  we  conclude  that  ||co||i  <  ||c,||i. 
Combining  this  result  with  (IV.35)  we  have, 

Aipg(ao)  +  (1  —  A)  ||ao||i  <  A0q(a  )  +  (1  —  A) 1 1 a  ||i,  (IV.37) 

so  that  ao  has  the  minimal  objective  from  all  possible  solutions  a'  such  that  x  =  Da'.  ■ 


We  conclude  that  we  can  guarantee  recovery  for  every  choice  of  A  as  long  as  (IV.27)-(IV.29)  are 
satisfied.  We  therefore  turn  to  study  these  conditions  in  more  detail. 

Theorem  2.  Let  gsss,  ^Bs,  vs  be  the  sparse  block-coherence  measures  defined  in  (IV.20),(IV.21)  and 
(IV.22),  and  let  u  be  the  sub-coherence  of  the  dictionary  D  defined  by  (IV.  16).  Then  the  conditions  in 
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(IV.27)-(IV.29)  arc  satisfied  if 


_ kgnBs _ 

1  -  (s  -  l)u  -  (k  -  l)gnBss 
kgus 

1  -  (s  -  l)v  -  (k  -  1  )gg,Bss 
ksp 

1  —  (s  —  l)u  —  {k  —  1  )sp 


<  1 
<  1 
<  1. 


We  also  assume  that  all  denominators  arc  positive. 


(IV.  38) 
(IV.  39) 
(IV.  40) 


Recall  from  Proposition  1  that  gpsss  <  -s / / .  Therefore,  the  denominators  in  (IV.38)  and  (IV.39)  arc 
smaller  than  that  in  (IV.40).  However,  there  is  no  general  ordering  between  the  numerators.  In  the  special 
case  in  which  the  individual  dictionaries  D,  consist  of  orthonormal  columns,  v  =  vs  =  0. 

Proof:  We  begin  by  developing  a  bound  on  pc(QDo).  In  [25]  it  is  shown  that  p,,(r  )  is  submultiplicative.6 
Therefore, 

Pci QDo)  <  Pc(((Do)TDq)_1)pc((Dq  )tD0).  (IV.41) 

By  definition, 

Pc((Dq)TD0)  =  max  ]T  p(jT DfD,),  (IV.42) 

a0 

where  Ao  is  the  set  of  indices  i  for  which  Df  is  in  Do-  Every  element  in  the  sum  is  bounded  above  by 
gpss-  Since  Ao  contains  k  indices,  we  conclude  that 

Pc( QD)  <  pc(((Dg  )tDq  )_1)/r£f/iBS-  (IV.43) 

It  remains  to  develop  a  bound  for /?c(((Dq)7Dq)_1).  To  this  end,  we  express  (Dq)7Dq  as  (Dq)tDq  = 
I  +  W,  where  W  is  a  ( ks )  x  ( ks )  matrix  with  blocks  W^r  of  size  s  x  s  such  that  W^r[i,  i]  =  0,  for 
all  i.  This  follows  from  the  fact  that  the  columns  of  D  are  normalized.  Since  Wf  .r  =  [Dq]J  [Dg]r,  for 
all  i  /  r,  and  W,yr  =  [DqJ^Dq],.  —  Is,  we  have 

Pci W)  =  maxV^  p{ Wfr)  <  max  p(Wr  r)  +maxy^  p(W£  r)  (IV. 44) 

t  l^r 

<{s-  l)v  +  (k  —  l)gpBss-  (IV.45) 

By  our  assumptions,  (s  —  l)u  +  (/c  —  l)gp,Bss  <  1-  Therefore,  pc(W)  <  1.  We  next  use  the  following 
result  from  [25]. 

6  A  matrix  norm  is  called  submultiplicative  if  ||ZW|j  <  ||Z||  ||W||  for  all  matrixes  Z,  W  G  RnXri. 
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Lemma  2.  Suppose  that  pc( W)  <  1.  Then  (I  +  W)-1  =  T,kLo(~W)k- 


Using  Lemma  2,  we  have  that 


(a) 


/^(((Dq^Dq  )-1)  =  pc  J>W)fc  <  ]T  (Pc(W)) 


\fc= 0 

1 


k=0 


(b) 

< 


1 


(IV.  46) 


1  -  pc(W)  1  -  (s  -  l)u  -  (k  -  l)gpBss 

Here,  (a)  is  a  consequence  of  pc{ W)  satisfying  the  triangle  inequality  and  being  submultiplicative,  and 
(b)  follows  by  using  (IV.45).  Combining  (IV.46)  with  (IV. 43),  we  obtain 

kgpBs 


Pc(QD0)  < 


1  -  0  -  l)v  -  (k  -  1  )gpBs 


(IV.  47) 


from  which  (IV.  3  8)  follows. 


We  now  use  the  same  technique  to  bound  pc  { QDq ) .  Using  the  same  method  as  above  we  will  get  a 
similar  bound,  with  the  only  difference  being  in  the  term  pc((Dg  )tDq  ).  By  definition, 

Pc(( D(?)TDf)  =  max  V  p(lT^I^l),  (IV.48) 

ieA0  f—? 

where  Dj  indicates  the  columns  of  D,  not  chosen  by  I.  Each  element  p(lT Y)fDf)  is  bounded  by  gvs . 
Since  there  are  k  elements  in  the  sum,  pc((Dq)tDq)  <  kgus  from  which  (IV.39)  follows. 


Finally,  we  use  the  same  ideas  to  bound  1 1 QD 1 1 1,1  and  derive  (IV. 40).  Specifically, 

|M||(D05fD||M. 


|QD||i,i  <  ||((Do)TDg)_1||1.1  ||(Dn)TD| 


(IV.  49) 


Now 

II(Do)TD||i,i  —  max  |dfd,|, 

where  Ao  is  the  set  of  active  dictionary  columns.  Now,  Ao  contains  ks  indices,  so  that 
|| (Dq1)7  D || i;i  <  ksp,  which  allows  us  to  conclude  that  HQDHqi  <  ||((Dq)7  Dq )_1  It  remains 

to  develop  a  bound  on  ||((Dq)tDq )_1||i,i.  To  this  end  we  express  (Dg)2Dg  as  (Dq)tDq  =  I  +  W, 
and  bound  ||W||i.i,  ||W||i.i  <  (s  —  l)v  +  (k  —  1  )sp.  Continuing  as  before  leads  to  (IV. 40).  ■ 


The  above  results  are  for  the  non-collaborative  case.  For  the  collaborative  case  there  exist  results 
that  show  that  both  the  C-Fasso  [9]  and  C-GFasso  [24]  will  recover  the  true  shared  active  set  with  a 
probability  of  error  that  vanishes  exponentially  with  n.  Since  the  in-group  active  sets  arc  not  necessarily 
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equal  for  all  samples  in  X,  C-HiLasso  could  only  help  in  recovering  the  group  sparsity  pattern.  Since 
the  C-GLasso  is  a  special  case  of  C-HiLasso  when  Ai  =  0,  we  can  conjecture  that  when  Ai  >  0,  the 
accuracy  of  the  C-HiLasso  in  recovering  the  correct  groups  can  only  improve  with  larger  n.  Furthermore, 
since  our  results  for  the  non-collaborative  HiLasso  improve  on  those  of  the  non-collaborative  C-GLasso, 
it  is  to  be  expected  that  the  accuracy  of  C-HiLasso,  for  an  appropriate  Ai  >  0,  will  be  better  than  that 
of  C-GLasso. 

As  an  intuitive  explanation  of  why  this  may  happen,  the  proofs  in  [9]  and  [24]  assume  a  continuous 
probability  distribution  on  the  non-zero  coefficients  of  the  signals,  and  give  recovery  results  for  the 
average  case.  On  the  other  hand,  the  in-group  sparsity  assumption  of  C-HiLasso  implies  that  only  s  out 
of  g  samples  will  be  nonzero  within  each  group.  This  implies  that,  for  the  same  group  sparsity  pattern, 
there  will  be  much  less  (exactly  a  fraction  s/g )  non- zero  elements  in  the  possible  signals  compared  to 
the  ones  that  can  occur  under  the  hypothesis  of  C-GLasso.  Since  any  assumed  distribution  of  the  signals 
under  the  in-group  sparsity  hypothesis  has  to  be  concentrated  on  this  much  smaller  set  of  possible  signals, 
they  should  be  easier  to  recover  correctly  from  solutions  to  the  C-HiLasso  program,  compared  to  the 
dense  group  case  of  C-GLasso. 


V.  Experimental  results 

In  this  section  we  show  the  strength  of  the  proposed  HiLasso  and  C-HiLasso  models.  We  start  by 
comparing  our  model  with  the  standard  Lasso  and  Group  Lasso  using  synthetic  data.  We  created  \Q\ 
dictionaries,  Dj,  with  g  =  64  atoms  of  dimension  m  =  64,  with  i.i.d.  Gaussian  entries.  The  columns 
were  normalized  to  have  unit  (2  norm.  Then  we  randomly  chose  two  groups  to  be  active  at  each  time 
(on  all  the  signals).  Sets  of  n  =  200  normalized  testing  signals  were  generated,  one  per  active  group,  as 
lineal-  combinations  of  s  <C  64  elements  of  the  dictionaries,  x*  =  D  a®.  The  mixtures  were  created  by 
summing  these  signals  and  (eventually)  adding  Gaussian  noise  of  standard  deviation  a.  The  generated 
testing  signals  have  a  hierarchical  sparsity  structure  and  while  they  share  groups,  they  do  not  necessarily 
share  the  sparsity  pattern  inside  the  groups. 

We  then  built  a  single  dictionary  by  concatenating  the  sub-dictionaries,  D  =  [Di, . . . ,  D | c; | ] ,  and  used 
it  to  solve  the  Lasso,  group  Lasso,  HiLasso  and  C-HiLasso  problems.  Table  I  summarizes  the  Mean 
Squared  Error  (mse)  and  Hamming  distance  of  the  recovered  coefficient  vectors.  We  observe  that  our 
model  is  able  to  exploit  the  hierarchical  structure  of  the  data  as  well  as  the  collaborative  structure.  Group 
Lasso  selects  in  general  the  correct  blocks  but  it  does  not  give  a  sparse  solution  within  them.  On  the 
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s  =  8 

38.8/22.0 

118.4/318.2 

27.2/19.5 

9.6/16.2 

s  =  12 

120.0/36.2 

116.6/350.4 

70.4/26.5 

41.3/29.1 

s  =  16 

164.1/43.9 

109.3/338.6 

110.0/32.2 

55.1/35.0 

a  =  0.1 

41.7/22.0  117.3/361.6 

33.0/19.8  16.3/13.3 

a  =  0.2 

56.4/21.6  118.2/378.3 

39.9/22.7  24.9  /17.1 

a  =  0.4 

96.5/22.7  137.8/340.3 

65.6/19.5  59.5  /27.4 

II 

oi 

108.0/27.8  191.6/221.7 

100.9/29.8  74.2  /  30.2 

^0 

II 

00 

120.0/36.2  116.6/350.4 

70.4/26.5  41.3  /  29.1 

101  =  12 

103.0/41.8  84.0/447.7 

66.2/26.4  0.37/  29.8 

TABLE  I 

Simulated  signal  results.  In  each  table,  each  2x2  cell  contains,  top  to  bottom,  left  to  right,  the  result 
of  Lasso,  GLasso,  HiLasso  and  C-HiLasso.  For  each  method,  the  separation  error  (multiplied  by  103)  and 
Hamming  distance  are  shown  as  (MSE/Hamming).  In  the  first  case  (left)  we  vary  the  noise  a  while 

KEEPING  \g\  =  8  AND  S  =  8  FIXED.  IN  THE  SECOND  AND  THIRD  CASES  WE  HAVE  a  =  0.  FOR  THE  SECOND  EXPERIMENT 
(CENTER)  WE  FIXED  \Q\  =  8  WHILE  CHANGING  S.  IN  THE  THIRD  CASE  WE  FIX  S  =  12  AND  VARY  THE  NUMBER  OF  GROUPS 

\Q\.  Bold  indicates  the  best  results,  always  obtained  for  the  proposed  models. 


other  hand,  Lasso  gives  a  solution  that  has  nonzero  elements  belonging  to  groups  that  were  not  active 
in  the  original  signal,  leading  to  a  wrong  model/class  selection.  HiLasso  gives  a  sparse  solution  that 
picks  atoms  form  the  correct  groups  but  still  presents  some  minor  mistakes.  For  the  collaborative  case, 
in  all  the  tested  configurations,  no  coefficients  were  selected  outside  the  correct  active  groups,  and  the 
recovered  coefficients  are  consistently  the  best  ones. 

In  all  the  examples,  and  for  each  method,  the  regularization  parameters  were  the  ones  for  which  the 
best  results  where  obtained.  One  can  scale  the  parameter  A2  to  account  for  different  number  of  signals. 
This  situation  is  analogous  to  a  change  in  the  size  of  the  dictionary,  thus,  A2  should  be  proportional  to 
the  square  root  of  the  number  of  signals  to  code. 


We  then  experimented  with  the  USPS  digits  dataset,  which  has  been  shown  to  be  well  represented 
in  the  sparse  modeling  framework  [34],  Here  the  signals  are  vectors  containing  the  unwrapped  gray 
intensities  of  16  x  16  images  (m  =  256).  We  obtained  each  of  the  n  =  200  samples  in  the  testing  data 
set  as  the  mixture  of  two  randomly  chosen  digits,  one  from  each  of  the  two  drawn  set  of  digits.  In  this 
case  we  only  have  ground  truth  at  the  group  level.  We  measure  the  recovery  performance  in  terms  of 

1  ^  1  where  x*  is  the  component  corresponding  to 


xj 

1 

X> 

7 

Jll 

the  “separation  error”  [35],  ^  XX  i  E”=  1 
source  i  in  the  signal  j,  and  x*  is  the  recovered  one. 


Using  the  usual  training-testing  split  for  USPS  we  first  learned  a  dictionary  for  each  digit.  We  then 
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experiment 

Lasso 

Glasso 

HiLasso 

C-GLasso 

C-HiLasso 

APSNR 

Hamm 

APSNR 

Hamm 

APSNR 

Hamm 

APSNR 

Hamm 

APSNR 

Hamm 

1  digit 

0.06 

0.43 

0.07 

0.78 

0.02 

0.19 

0.01 

0.02 

0.02 

0.06 

1  digit+n 

0.08 

1.31 

0.08 

0.87 

0.04 

0.48 

0.05 

0.25 

0.02 

0.01 

2  digit 

0.09 

1.46 

0.08 

1.86 

0.02 

1.18 

0.01 

0.74 

0.02 

0.90 

2  digit+n 

0.11 

2.21 

0.08 

1.99 

0.04 

1.46 

0.09 

1.60 

0.03 

0.70 

TABLE  II 


Noisy  digit  mixtures  results.  Four  different  cases  are  shown:  when  each  signal  is  a  single  digit  and 

WHEN  IT  IS  THE  MIXTURE  OF  TWO  DIFFERENT  (RANDOMLY  SELECTED)  DIGITS,  WITH  AND  WITHOUT  ADDITIVE  GAUSSIAN 
NOISE  WITH  STANDARD  DEVIATION  10%  OF  THE  PEAK  VALUE.  FOR  THE  2  DIGITS  CASE,  RESULTS  ARE  THE  AVERAGE  OF  8 
RUNS  (IN  EACH  ROUND  A  NEW  PAIR  OF  DIGITS  WAS  RANDOMLY  SELECTED).  IN  THE  SINGLE  DIGIT  CASE,  THE  RESULT  IS 
THE  AVERAGE  OF  THE  TEN  POSSIBLE  SITUATIONS.  WITHOUT  NOISE,  BOTH  C-GLASSO  AND  C-HlLASSO  YIELD  VERY  GOOD 

results.  However,  in  the  noisy  case,  C-HiLasso  is  clearly  superior,  showing  the  advantage  of  adding 

REGULARIZATION  INSIDE  THE  GROUPS  FROM  A  ROBUSTNESS  PERSPECTIVE.  SEE  ALSO  FIGURE  3. 
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Fig.  3.  In  this  example  we  used  C-HiLasso  to  analyze  mixtures  where  the  data  set  contains  different  number  and  types  of 
sources/classes.  We  used  a  set  containing  180  mixture  of  digit  images.  The  first  150  images  are  obtained  as  the  sum/mixture  of 
a  number  “3”  and  an  number  “5”  (randomly  selected).  Each  of  the  last  30  images  in  the  set  are  the  mixture  of  three  numbers: 
“3”  ,“5”  and  “7”  (the  180  images  are  of  course  presented  at  random,  the  algorithm  is  not  apriori  aware  which  images  contain  2 
sources  and  which  contain  3).  The  figure  shows  the  active  sets  of  the  recovered  coefficients  matrix  A  as  a  binary  matrix  the  same 
size  as  A  (atom  indexes  in  the  vertical  and  sample  indexes  in  the  horizontal),  where  black  dots  indicate  nonzero  coefficients. 
C-HiLasso  managed  to  identify  the  active  blocks  while  the  sub-dictionary  corresponding  to  “7”  is  mostly  active  for  the  last  30 
images.  The  accuracy  of  this  result  depends  on  the  relationship  between  the  sub-dictionaries  corresponding  to  each  digit. 


created  a  single  dictionary  by  concatenating  them.  In  Table  II  we  show  the  separation  error  obtained  while 
summing  two  different  numbers.  We  also  consider  the  situation  were  only  on  digit  is  present.  C-HiLasso 
automatically  detects  the  number  of  sources  while  achieving  the  best  recovery  performance.  As  in  the 
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synthetic  case,  only  the  collaborative  method  was  able  to  successfully  detect  the  true  active  classes.  In 
Figure  3  we  relax  the  assumption  that  all  the  signals  have  to  contain  exactly  the  same  type  and  amount 
of  classes  in  the  mixture,  further  demonstrating  the  flexibility  of  the  proposed  C-HiLasso  model. 

We  also  used  the  digits  dataset  to  experiment  with  missing  data.  We  randomly  discarded  an  average 
of  60%  of  the  pixels  per  mixed  image  and  then  applied  C-Hilasso.  The  algorithm  is  capable  of  correctly 
detecting  which  digits  are  present  in  the  images.  Some  example  results  for  this  case  arc  shown  in  Figure  4. 
Note  that  this  is  a  quite  different  problem  than  the  one  commonly  addressed  in  the  matrix  completion 
literature.  Flere  we  do  not  aim  to  recover  signals  that  all  belong  to  a  unique  unknown  sub-space,  but 
signals  that  arc  the  combination  of  two  non-unique  spaces  to  be  automatically  identified  from  the  available 
dictionary.  Such  unknown  spaces  have  common  models/groups  for  all  the  signals  in  question  (the  coarse 
level  of  the  hierarchy),  but  not  necessarily  the  exact  same  atoms  inside  the  groups  and  therefore  not 
necessarily  belong  to  the  same  sub-spaces.  Both  levels  of  the  hierarchy  are  automatically  detected,  e.g., 
that  the  groups  are  those  corresponding  to  “3”  and  “5,”  and  the  corresponding  reconstructing  atoms  (sub¬ 
spaces)  in  each  group,  these  last  ones  possibly  different  for  each  signal  in  the  set.  While  we  consider  that 
the  possible  sub-spaces  arc  to  be  selected  from  the  provided  dictionary  (learned  off-line  from  training 
data),  in  Section  VI  we  discuss  learning  such  dictionaries  as  part  of  the  optimization  as  well  (see  also 
[36]).  In  such  case,  the  standard  matrix  completion  problem  becomes  a  particular  case  of  the  C-HiLasso 
framework  (with  a  single  group  and  all  the  signals  having  the  same  active  set,  sub-space,  in  the  group), 
naturally  opening  numerous  theoretical  questions  for  this  new  more  general  model.7 

Finally,  we  compared  the  performance  of  C-HiLasso,  Lasso,  Group  Lasso  (GLasso)  and  C-GLasso 
(without  hierarchy)  in  the  task  of  separating  mixed  textures  in  an  image.  We  chose  8  textures  from  the 
Brodatz  dataset  and  trained  one  dictionary  for  each  one  of  them  (these  form  the  8  groups  of  the  dictionary). 
Then  we  created  an  image  as  the  sum  of  two  textures  (the  testing  images  were  not  used  in  the  training 
stage).  One  can  think  of  this  experiment  as  a  generalization  to  the  texture  separation  problem  proposed 
in  [35]  (without  additive  noise),  where  only  two  textures  are  present.  The  experiment  was  repeated  for 
all  possible  combinations  of  two  textures  from  the  8  possible  ones,  and  the  results  arc  summarized  in 
Table  III.  A  detailed  example  is  shown  in  Figure  5.  For  each  algorithm,  the  best  parameters  were  chosen 
using  grid  search,  ensuring  that  those  were  not  in  the  edges  of  the  grid.  For  Lasso  and  C-HiLasso  the  best 
Ai  is  0.0625.  For  GLasso  and  C-GLasso,  the  best  A2  was,  respectively,  0.05  and  75.  From  Table  III  we 

7Prof.  Carin  and  collaborators  have  new  results  on  the  case  of  a  single  group  and  signals  in  possible  different  sub-spaces  of 
the  group,  an  intermediate  model  between  standard  matrix  completion  and  C-HiLasso  (personal  communication). 


27 


Fig.  4.  Example  of  recovered  digits  (3  and  5)  from  a  mixture  with  60%  of  missing  components.  From  left  to  right:  noiseless 
mixture,  observed  mixture  with  missing  pixels  highlighted  in  red,  recovered  digits  3  and  5,  and  active  set  recovered  for  all 
samples  using  the  C-HiLasso  and  Lasso  respectively.  In  the  last  two  figures,  the  active  sets  are  represented  as  in  Figure  3.  The 
coefficients  corresponding  to  the  subdictionaries  for  digits  3  and  5  are  marked  as  pink  bands.  Notice  that  the  C-HiLasso  exploits 
efficiently  the  hypothesis  of  collaborative  group-sparsity,  succeeding  in  recovering  the  correct  active  groups  in  all  the  samples. 
The  Lasso,  which  lacks  this  prior  knowledge,  is  clearly  not  capable  of  doing  so,  and  active  sets  spread  all  over  the  groups. 


Fig.  5.  Texture  separation  results.  Left  to  right:  sample  mixture,  corresponding  C-HiLasso  separated  textures,  and  comparison 
of  the  active  set  diagrams  obtained  by  the  Lasso  (as  in  Figure  4).  The  one  for  Lasso  is  shown  on  top,  where  all  groups  are 
wrongly  active  ,  and  the  one  for  C-HiLasso  on  bottom,  showing  that  only  the  two  correct  groups  are  selected. 


can  conclude  that  the  C-HiLasso  is  significantly  better  than  the  competing  algorithms,  both  in  PSNR  of 
the  recovered  signals  (we  show  the  average  PSNR  of  recovering  both  active  signals),  and  in  the  average 
Hamming  distance  between  the  recovered  group-wise  active  sets  and  the  true  ones.  In  the  latter  case  we 
observe  that,  in  many  cases,  the  C-HiLasso  active  set  recovery  performance  is  perfect  (Hamming  distance 
0)  or  near  perfect,  whereas  the  other  methods  seldom  approach  a  Hamming  distance  lower  than  1. 

VI.  Discussion 

We  introduced  a  new  framework  of  collaborative  hierarchical  sparse  coding,  where  multiple  signals 
collaborate  in  their  encoding,  sharing  code  groups  (models)  and  having  (possible  disjoint)  sparse 
representations  inside  the  corresponding  groups.  An  efficient  optimization  approach  was  developed,  which 
guarantees  convergence  to  the  global  minimum,  and  examples  illustrating  the  power  of  this  framework 
were  presented.  At  the  practical  level,  we  are  currently  working  on  the  applications  of  this  proposed 
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2.80  0.42 
1.36  0.00 


0.33  0.25 
2.06  0.00 


0.96  0.01 
1.97  0.00 
1.02  1.00 
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24.2 

18.9 
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22.8 
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2.42 

1.42 
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0.00 

1.00 
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2.25 

2.85 


1.00 

0.35 


27.2 

23.3 

20.4 
20.0 


23.3 

27.5 

20.8 

21.1 


24.6 

20.8 


22.1 

25.4 


23.1 

20.9 


21.4 

22.6 


20.7 

18.8 
17.2 
15.9 


17.6 

22.9 

16.3 

18.5 


19.8 

16.7 


19.5 

22.1 


19.1 
16.5 
20.7 

19.2 


18.4 

20.1 

21.2 

22.3 


19.7 

19.9 

16.2 

16.1 


13.5 
23.8 

16.6 
17.5 


17.9 

17.0 


18.5 

19.7 


17.4 

16.7 
19.2 

19.7 


18.3 

19.7 

20.6 

21.5 


31.5 

25.7 

21.7 
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23.7 

34.9 

19.8 
27.3 


26.8 

19.9 


20.3 

30.0 


25.8 
20.7 
28.3 

23.9 


20.5 

30.0 

22.0 

30.4 


4.12 

3.23 


0.53 

0.82 


3.48 

3.54 


0.44 

0.20 


3.49 

3.11 


0.32 

0.01 


3.16 

4.07 


1.00 

0.40 


16.4 

16.1 


16.2 

17.9 


22.5 

19.3 


20.2 

25.7 


4.37  1.39 
2.51  0.02 
0.09  0.98 
0.53  0.00 


4.47 

2.39 

3.77 

1.75 


0.08 

0.22 

1.00 

0.01 


4.09 

2.42 

0.31 

2.04 


0.13 

0.02 

1.00 

0.00 


4.23 

2.76 

1.83 

1.82 


0.12 

0.02 

1.00 

0.00 


4.20 

2.24 

1.13 

2.18 


1.00 

0.20 

1.00 

0.00 


4.42  0.42 
2.96  0.11 
3.14  0.97 
3.04  0.24 


20.0 

19.9 


19.5 

22.9 


4.30 

1.90 


1.00 

0.18 


TABLE  III 

Texture  separation  results.  The  rows  and  columns  indicate  the  active  textures  in  each  cell.  The  upper 

TRIANGLE  CONTAINS  THE  PSNR  RESULTS,  WHILE  THE  LOWER  TRIANGLE  SHOWS  THE  HAMMING  ERROR  IN  THE 
GROUP-WISE  ACTIVE  SET  RECOVERY.  WITHIN  EACH  CELL,  RESULTS  ARE  SHOWN  FOR  THE  LASSO  (TOP  LEFT),  GROUP 

Lasso  (bottom  left),  Collaborative  Group  Lasso  (top  right)  and  Collaborative  Hierarchical  Lasso 

(BOTTOM  RIGHT).  THE  BEST  RESULTS  ARE  IN  BLUE  BOLD. 


framework  in  a  number  of  directions,  including  collaborative  instruments  separation  in  music,  signal 
classification,  and  speaker  recognition,  following  the  here  demonstrated  capability  to  collectively  select 
the  correct  groups/models. 

At  the  theoretical  level,  a  whole  family  of  new  problems  is  opened  by  this  proposed  framework,  some 
of  which  we  already  addressed  in  this  work.  A  critical  one  is  the  overall  capability  of  selecting  the  correct 
groups  in  the  collaborative  scenario,  with  missing  information,  and  thereby  of  performing  correct  model 
selection  and  source  identification  and  separation.  Results  in  this  direction  will  be  reported  in  the  future. 

Finally,  we  have  also  developed  an  initial  framework  for  learning  the  dictionary  for  collaborative 
hierarchical  sparse  coding,  meaning  the  optimization  is  simultaneously  on  the  dictionary  and  the  code. 
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As  it  is  the  case  with  standard  dictionary  learning,  this  is  expected  to  lead  to  significant  performance 
improvements  (see  [34]  for  the  particular  case  of  this  with  a  single  group  active  at  a  time). 

Acknowledgments:  Work  partially  supported  by  NSF,  NSSEFF,  ONR,  NGA,  and  ARO.  We  thank  Dr.  Tristan 
Nguyen,  when  we  presented  him  this  model,  he  motivated  us  to  think  in  a  hierarchical  fashion  and  to  look  at  this 
as  just  the  particular  case  of  a  fully  hierarchical  sparse  coding  framework.  We  thank  Prof.  Tom  Luo  and  Gonzalo 
Mateos  for  invaluable  help  and  advice  on  the  use  of  admom.  Finally,  we  thank  Prof.  Larry  Carin,  Dr.  Guoshen  Yu 
and  Alexey  Castrodad  for  very  stimulating  conversations,  and  for  the  fact  that  their  own  work  (for  LC,  GY,  and 
AC)  also  motivated  in  part  the  example  with  missing  information. 


References 

[1]  M.  Yuan  and  Y.  Lin,  “Model  selection  and  estimation  in  regression  with  grouped  variables,”  J.  Royal  Stat.  Society,  Series 
B,  vol.  68,  pp.  49-67,  2006. 

[2]  R.  Jenatton,  J.  Audibert,  and  F.  Bach,  “Structured  variable  selection  with  sparsity-inducing  norms,”  Tech.  Rep. 
arXiv:0904.3523vl,  INRIA,  2009. 

[3]  Y.  C.  Eldar  and  M.  Mishali,  "Robust  recovery  of  signals  from  a  structured  union  of  subspaces,”  IEEE  Trans.  Inform. 
Theory,  vol.  55.  no.  11,  pp.  5302-5316,  Nov.  2009. 

[4]  J.  Tropp,  “Algorithms  for  simultaneous  sparse  approximation,  part  II:  Convex  relaxation,”  Signal  Processing,  vol.  86,  no. 
3,  pp.  589-602,  2006. 

[5]  J.  Tropp.  A.  Gilbert,  and  M.  Strauss,  “Algorithms  for  simultaneous  sparse  approximation,  part  I:  Greedy  pursuit,”  Signal 
Processing,  vol.  86,  no.  3,  pp.  572-588,  2006. 

[6]  S.  Cotter,  B.  Rao,  K.  Engan,  and  K.  Kreutz-Delgado,  “Sparse  solutions  to  linear  inverse  problems  with  multiple  measurement 
vectors,”  IEEE  Trans.  Sig.  Proc.,  vol.  53,  no.  7,  pp.  2477-2488,  July  2005. 

[7]  J.  Chen  and  X.  Huo,  “Theoretical  results  on  sparse  representations  of  multiple-measurement  vectors,”  IEEE  Trans.  Sig. 
Proc.,  vol.  54,  no.  12.  pp.  4634-4643,  Dec.  2006. 

[8]  M.  Mishali  and  Y.  C.  Eldar,  “Reduce  and  boost:  Recovering  arbitrary  sets  of  jointly  sparse  vectors,”  IEEE  Trans.  Sig. 
Proc.,  vol.  56,  no.  10.  pp.  4692-4702,  Oct.  2008. 

[9]  Y.  C.  Eldar  and  H.  Rauhut.  “Average  case  analysis  of  multichannel  sparse  recovery  using  convex  relaxation,”  to  appear  in 
IEEE  Trans,  on  Inform.  Theory. 

[10]  S.  Wright,  R.  Nowak,  and  M.  Figueiredo,  “Sparse  reconstruction  by  separable  approximation,”  IEEE  Trans.  Sig.  Proc., 
vol.  57,  no.  7,  pp.  2479-2493,  2009. 

[11]  D.  Bertsekas  and  J.  Tsitsiklis,  Parallel  and  Distributed  Computation:  Numerical  Methods,  Prentice  Hall,  1989. 

[12]  T.  Goldstein  and  S.  Osher,  “The  split  Bregman  method  for  11  regularized  problems,”  SIAM  J.  Imaging  Sciences,  vol.  2, 
no.  2,  pp.  323-343,  2009. 

[13]  J.-A.  Bazerque,  G.  Mateos,  and  G.  Giannakis,  "Distributed  lasso  for  in-network  linear  regression,”  in  Proc.  ICASSP,  Mar. 

2010. 

[14]  P.  Sprechmann,  I.  Ramirez,  and  G.  Sapiro,  “Collaborative  hierarchical  sparse  modeling,”  in  CISS,  Mar.  2010. 

[15]  J.  Friedman.  T.  Hastie.  and  R.  Tibshirani,  “A  note  on  the  group  lasso  and  a  sparse  group  lasso,”  preprint  (2010),  available 

at  http : / / www- st at . Stanford . edu/~tibs. 


30 


[16]  J.  Peng,  J.  Zhu,  A.  Bergamaschi,  W.  Han,  D.  Noh.  J.  Pollack,  and  P.  Wang,  “Regularized  multivariate  regression  for 
identifying  master  predictors  with  application  to  integrative  genomics  study  of  breast  cancer,”  To  appear  in  Annals  of 
Applied  Statistics. 

[17]  S.  Kim  and  E.  P.  Xing,  “Tree-guided  group  lasso  for  multi-task  regression  with  structured  sparsity,”  in  ICML ,  June  2010. 

[18]  R.  Jenatton,  J.  Mairal,  G.  Obozinski,  and  F.  Bach,  “Proximal  methods  for  sparse  hierarchical  dictionary  learning,”  in 
ICML ,  June  2010. 

[19]  J.  Starck,  M.  Elad,  and  D.  Donoho,  “Image  decomposition  via  the  combination  of  sparse  representations  and  a  variational 
approach,”  IEEE  Trans.  Image  Proc.,  vol.  14,  pp.  1570-1582,  2004. 

[20]  R.  Tibshirani,  “Regression  shrinkage  and  selection  via  the  LASSO,”  J.  Royal  Stat.  Society:  Series  B.  vol.  58,  no.  1.  pp. 
267-288,  1996. 

[21]  S.  Chen,  D.  Donoho,  and  M.  Saunders,  “Atomic  decomposition  by  basis  pursuit,”  SIAM  J.  Scientific  Computing,  vol.  20, 
no.  1,  pp.  33-61,  1999. 

[22]  D.  Donoho,  “Compressed  sensing,”  IEEE  Trans,  on  Inf.  Theory,  vol.  52,  no.  4,  pp.  1289-1306,  Apr  2006. 

[23]  B.  Turlach,  W.  Venables,  and  S.  Wright,  “Simultaneous  variable  selection,”  Technometrics,  vol.  27,  pp.  349-363,  2004. 

[24]  P.  Boufounos,  G.  Kutyniok,  and  H.  Rauhut,  “Sparse  recovery  from  combined  fusion  frame  measurements,” 
arXiv:0912.4988vl. 

[25]  Y.  C.  Eldar,  P.  Kuppinger,  H.,  and  Bolcskei,  “Compressed  sensing  of  block-sparse  signals:  Uncertainty  relations  and 
efficient  recovery,”  to  appear  in  IEEE  Trans.  Sig.  Proc. 

[26]  I.  Daubechies,  M.  Defrise,  and  C.  De  Mol,  “An  iterative  thresholding  algorithm  for  linear  inverse  problems  with  a  sparsity 
constraint.,”  Comm,  on  Pure  and  Applied  Mathematics,  vol.  57,  pp.  1413-1457,  2004. 

[27]  R.  Giryes,  M.  Elad,  and  Y.  C.  Eldar,  “The  projected  GSURE  for  automatic  parameter  tuning  in  iterative  shrinkage  methods,” 
Submitted  to  Applied  and  Computational  Harmonic  Analysis. 

[28]  E.  Candes,  J.  Romberg,  and  T.  Tao,  “Robust  uncertainty  principles:  Exact  signal  reconstruction  from  highly  incomplete 
frequency  information,”  IEEE  Trans.  Inform.  Theory,  vol.  52,  no.  2,  pp.  489-509,  Feb.  2006. 

[29]  E.  Candes,  “The  restricted  isometry  property  and  its  implications  for  compressed  sensing,”  C.  R.  Acad.  Sci.  Paris  S’er.  I 
Math.,  vol.  346,  pp.  589-592,  2008. 

[30]  J.  Tropp,  “Greed  is  good:  Algorithmic  results  for  sparse  approximation,”  IEEE  Trans.  Inform.  Theory,  vol.  50,  no.  10,  pp. 
2231-2242,  Oct.  2004. 

[31]  A.  d’Aspremont,  L.  El  Ghaoui,  M.  Jordan,  and  G.  Lanckriet,  “A  direct  formulation  for  sparse  PCA  using  semidefinite 
programming,”  Neural  Information  Processing  Systems,  vol.  17,  2004. 

[32]  H.  Zou,  T.  Hastie,  and  R.  Tibshirani,  “Sparse  principal  component  analysis,”  Journal  of  Computational  and  Graphical 
Statistics,  vol.  15,  no.  2,  2003. 

[33]  B.  Moghaddam,  Y.  Weiss,  and  S.  Avidan,  “Spectral  bounds  for  sparse  PCA:  Exact  &  greedy  algorithms,”  Neural  Information 
Processing  Systems,  vol.  18,  2006. 

[34]  1.  Ramirez,  P.  Sprechmann,  and  G.  Sapiro,  “Classification  and  clustering  via  dictionary  learning  with  structured 
incoherence,”  in  CVPR,  June  2010. 

[35]  N.  Shoham  and  M.  Elad,  “Alternating  KSVD-denoising  for  texture  separation,”  in  The  IEEE  25-th  Convention  of  Electrical 
and  Electronics  Engineers  in  Israel,  2008. 

[36]  M.  Zhou,  H.  Chen,  J.  Paisley,  L.  Ren,  L.  Li,  Z.  Xing,  D.  Dunson,  G.  Sapiro,  and  Lawrence  Carin,  “Non- 
parametric  bayesian  dictionary  learning  for  analysis  of  noisy  and  incomplete  images,”  IMA  Preprint,  April  2010, 

http : / / www . ima . umn . edu /preprints/ apr2010/2307 .pdf. 


